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ABSTRACT 


An update and review of the research performed at Virginia 
Tech in the accuracy assessment of remotely sensed data during the 
past three years is given. This I'esearch included the use of 
discrete multivariate analysis techniques for the assessment of error 
matrices, the use of computer simulation for assessing various 
sampling strategies, and an investigation of spatial autocorrelation 
techniques. 


1.0 Introduction 


This report is an update and review of the research that was 
conducted at Virginia Tech in accuracy of remotely sensed data over 
the past three years. The majority of the report will briefly review 
the use of discrete multivariate analysis for assessing the accuracy 
of remotely sunsed data. Wherever appropriate, a citation where more 
detailed information can be found will be given. The remainder of the 
report will discuss our continuing research in sampling simulation for 
accuracy assessment and the effects of spatial autocorrelation on 
accuracy. Wherever possible, preliminary results will be given. 
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2.0 Discrete Multivariate Analysis Techniques for Accuracy Assessment 

Three discrete multivariate analysis procedures are used in 
the accuracy assessment of remotely sensed data. All three procedures 
operate on error matrices. An error matrix (Figure 1) is a square 
array of numbers set out in rows and coluimis which express the nuntoer 
of cells assigned as a particular land cover type relative to the 
actual cover type as verified in the field. The columns usually repre- 
sent the reference data Cgi^ound verified) and the rows indicate either 
the Landsat classification or the photo interpretation. 
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2.1 Review of the Normalization Procedure 

The normalizatfon proceaure is a method of standardizing each 
error matrix so that a direct comparison of individual cell values 
is possible. It is an iterative process (Bishop et aj[. 1975) by which 
the rows and columns of the matrix are successively balanced until 
each row and colunm sums to a given value (marginal). Therefore, each 
cell value is influenced by the omission and commission errors for 
that particular land cover category. After normalization, the cell 
values in corresponding positions of two or more error matrices can 
be compared without regard for differences in sample size between 
matrices, 

A FORTRAN computer program called MARGFIT can be used for per- 
forming the normalization process (Congalton al^. 1981). For details 

and examples of this technique see Congalton (1981). 
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2,2 Review of the Test of Agreernent Procedure 

The test of agreement procedure Is a method of testing the 
similarity or agreement between two or more error matrices (Bishop 
et 1981). This mewure of agreement, called KHAT, is based on 
the difference between the actual agreement of the classification 
(i.e.. agreement between the remote sensor data and the reference 
data) indicated by the diagonal cell value and the chance agreement 
which is indicated by the row and colurmi marginals. 

KHAT values are calculated for each matrix and reflect how 
well the remote sensor data agrees with the reference data. A test 
can then be performed between two independent KHAT values in order 
to determine if they are significantly different, i.e., if one matrix 
is significantly different from another. The equations, more details, 
and examples can be found in Congalton (1981) and Congalton and Mead 
(1983). Also a FORTRAN computer program called KAPPA can be used to 
perform this test of agreement procedure (Congalton 1981). 
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2.3 Review of the MuUi-factor Comparison Procedure 

The multi-factor con^arlson procedur? allows more than one 
factor affecting the classification accuracy to be examined at 
the same time. Appendix I contains a detailed explanation of this 
approach as well as a fully worked out exai!^)le. For more examples 
and the APL cwi^uter program, CONTABLE, used to perform t;h<* 
confutations see Congalton (1981). 
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3.0 Sampling Simulation for Accuracy Assessment 

Monte Carlo computer simulation techniques were used to test 
the effects of various sampling schemes on the accuracy assessment 
of renrately sensed data. Three data sets of varying spatial com- 
plexity were used: a forested area, a rangeland area, and an agricul- 
tural area. Each area had two classified data sets associated with 
it. One of the data sets (usually photointerpretation) was assumed 
correct and the other was the Landsat classification. A difference 
image was then created for each area by comparing, pixel by pixel, 
the assumed correct data set with the Landsat classification. A 
difference image is a matrix of zeros and ones in which the zeros 
indicate agreement between the data sets and the ones indicate disagree- 
ment. In other words, the difference image is indicative of the 
pattern of error occurring in the classification. The population 
parameters (mean and variance) were computed from a total enumeration 
of each difference image and were used to compare with the sample 
statistics from the sampling simulations. 

For a more detailed description of the data sets, as well as 
pictures of the difference images, see Congalton ^ (1982). 
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3.1 Preliminary Results 

Simulations have been run for simple random sampling and 
cluster sampling. Testing of other sampling schemes is yet to 


be done. 
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3.1.1 Intra-cluster Correlation Coefficients 

Intra-cluster correlation coefficient, ROH, is a measure of 
the homogeneity of the cluster. The more homogeneous the cluster, 
the greater the value of ROH. In order to maximize the given infor- 
mation within a cluster, one would wish the cluster to be as hetero- 
geneous as possible. Therefore, one would try to make ROH approach 
zero. Figure 2 shows a plot of average ROH vs. cluster size for each 
of the vegetation environments. 
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3.2 Conclusions 

Visual inspection of the difference images yielded different 
levels of honogeneity or spatial complexity in the three vegetation 
environments. As expected, the agricultural area was most homo- 
geneous while the forested area was most diverse. This result was 
also demonstrated in the olot of ROH vs. cluster size. Given that 
a large ROH means greater homogeneity notice that for each cluster 
size, the agricultural environment had the largest ROH, while the forest 
had the smallest. 

Also demonstrated in the plot of ROH vs. cluster size were some 
guidelines for cluster size determination. Despite the theoretical 
statement that ROH should go to zero, it is apparent from this plot 
that this is not always practically feasible. Note from the plot that 
between 0 and 20 pixels/cluster, ROH decreases rather quickly while 
after this point, the line levels off. It can be concluded from this 
result that large clusters taken to reduce ROH may not be gaining 
significantly ’n information while costing excessive time and money. 
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3.3 Further Work 

Research is continuing at Virginia Tech in sampling simula- 
tion as well as in the spatial autocorrelation analysis techniques 
that will be discussed next. 
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4.0 Spatial Autocorrelation Analysis 

Spatial autocorrelation analysis is a technique by which the 
pattern of a spatially distributed attribute can be Investigated. 

In other words, If the presence of a given attribute in a certain 
location makes its presence in surrounding locations more or less 
likely, then the attribute is said to exhibit spatial autocorrelation 
(Cliff and Ord 1973). 

Spatial autocorrelation analysis can be used to investigate 
the pattern of error in the difference images created from a Landsat 
classification and an assumed correct reference set. In this 
situation, the discrete binary classification applies. Each pixel has 
either been classified correctly or incorrectly and therefore a technique 
called join count statistics may be used to measure the spatial auto- 
correlation. A join is defined if any two pixels have a boundary of 
positive non-zero length in conmon (Cliff and Ord 1973). 

In the case of a difference image, there are three possible 
joins, 0-0 (correct-correct) , 1-1 (incorrect-incorrect) , and 0-1 or 1-0 
(correct-incorrect, incorrect-correct) . The method that is used to 
test if the pattern of error in the difference image differs significantly 
from random is to use the fact that the join count statistics are asymptotically 
normal. The mean and variance (first and second moments) are obtained using 
the equations in Appendix II and are compared with the observed counts 
to test for significance. 
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There are two possible sampling schemes from which the statistics 
can be derived. These are free sampling (i.e., sampling with replace- 
ment) or non-free sampling (i.e., sampling without replacement). In the 
case of a difference inage, non-free san^ling is employed since we 
assume that each pixel has the same a priori probability of being 
right or wrong (Cliff and Ord 1973). 

Combining the results of the spatial autocorrelation analysis 
with the results of the sampling simulations allows better interpre- 
tation of when to use which sampling scheme. By being able to examine 
the spatial pattern of the errors within the difference image, the 
results of the sampling simulation can be better explained. 
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5.0 Conclusions 

The research conducted at Virginia Tech over the last three 
years has contributed significantly to the accuracy assessment of 
remotely sensed data. The application of discrete nujltivariate 
analysis to accuracy assessment has provided a new technique for 
analyzing accuracy. The research in sampling simulation and spatial 
autocorrelation is yet to be completed. However, it is felt that 
confining the results of sampling simulation with the knowledge gained 
from spatial autocorrelation will yield better reconsnendations for 
which sampling schemes to use when. 
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The Use of Tmo Model Fitting Procedures for 
Determining Associations Between Four Spatial Variables 

ABSTRACT : This study describes the use of the log-linear and 
logit model fitting procedures which yield the best fitting model for 
determining associations between four spatially defined variables. 
These variables include (1) interspersion, (2) cover type, (3) aspect, 
and (4) elevation. 


Introduction 

A great deal of information has been generated concerning 
interactions of physical variables such as aspect and elevation on 
biotic variables such as cover type (Barbour et 1980). Knowledge 
of such interactions can be used as predictive tools in forestry and 
wildlife habitat analysis. Statistical procedures have been developed 
which can be used to determine associations between c^oss-classified 
categorical variables (Fienberg 1980; Bishop it 1975). Since most 
spatial variables may be readily grouped into categories (such as low, 
medium, and high elevations, or south vs. north facing slopes, etc.), 
these statistical procedures represent important analytic tools for 
resource managers. 

The purpose of this study was to determine associations between 
four such spatially defined variables. These included interspersion. 
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cover type, aspect, and elevation. The Inttrsperslon Index used in 
this study was described by Ne:kd et (1981). It niay be used as an 
index of habitat quality for any selected species. Thus, using the 
statistical procedures outlined here, associations between habitat 
quality, elevation, aspect, and cover type may be determined. 

Study Area 

The study area Included 1542 acres situated In the northwest 
section of the Rampart Hills Quadrangle 1n the San Juan National Forest. 
Colorado. This area was chosen becaure recent cover type Information 
was available and It has been used In previous habitat studies. 

Methods and Procedures 


Data Acquisition 

The covtr type Information wts provided by the Forest Service, 
and was originally wierived from Landsot imagery, The one acre Landsat 
pixels were grouped Into three acre sampling un^ts for the purpose of 
cover type mapping. This work was convicted by the Lockheed Corporation 
under Forest Service contract. The 19 categories present in the original 
work were collapsed into seven categories which represent in^orcant 
hibitat components for mule deer ( Odocolleus hemlor.us ). This collapsing 
procedure followed recommendaticms of Forest Service personnel (Cook, 
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personal communication). Of these seven cateyories, four were present 
in the study a>'ea. These included (1) conifer, (2) oak, (3) aspen, 
and (4) other. Categories which do not form critical habitat components 
for the species of interest were grouped into the fourth category. 

Interspersion was calculated for each cell (3-acre sampling unit) 
in the study area. The method of calculation was described by Mead et 
(1981) and by Heinen ^ (i381). Two categorici of interspersion 

were used in this study. These included low (0 to 0.5) and high {>0.5 
to 1.0). 

A grid representing each cell in the study was drawn directly onto 
a copy of the U. S. Geological S"**vey 7-1/2 minute Rampart Hills Quadrangle. 
The elevation and aspect information for each cell was then obtained 
directly from this map by following the contour lines. Two categories 
for each variable were used. The categories used for the elevation 
variable included low (^8920 ft) and high (>8920 ft). This arbitrary 
cutoff point was chosen because this contour line roughly divided the 
study area into two equal parts. The use of only two elevational categories 
is justified because there is only an approximate 1400 foot range in 
elevation over the entire study area. In many cases the 8920 foot contour 
transected individual cells. In these cases, a judgment was made as to 
whether more than one-half of the cell was above or below the cutoff 
point, and then categorized accordingly. 
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Each cell was grouped into one of two categories of aspect. These 
included south and north facing slopes. These two categories were chosen 
because of their ecological significance (Spurr, Barnes, 1973), and also 
because the ridges in the study area are oriented approximately from 
southwest to northeast. This topography greatly facilitated the categori- 
zation of aspect for each cell. 

The data set is presented in Table I. For the purposes of the 
statistical notation (explained below) the number assigned to each variable 
is important. These are (1) interspersion, (2) cover type, (3) aspect, 
and (4) elevation. Numbers are likewise assigned to each category within 
each variable. Variable 1 (interspersion) has two categories, (i) low 
and (ii) high, and variable 2 has four categories denoted as (i) conifer, 

(ii) oak, (iii) aspen, and (iv) other. Variables 3 (aspect) and 4 (elevation) 
each have two categories. In the case of the former these include (i) north- 
fa;ing slopes, and (ii) south-facing slopes. In the case of the latter 
these include (i) low elevation (<^8900 ft) and (ii) high elevation (>8900 
ft). The number designated in each cell on Table I indicates the total 
number of cells in the study area which fall into that particular combina- 
tion of variable categories. For example, a total of nine cells in the 
study area had low interspersion and were dominated by conifer on north- 
facing slopes at elevations at or below 8920 feet. 
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Table I. The nuntier of cells in the study area described in each of 
32 possible combinations of the four categorical variables. 

Aspect (3) 

North Sou ,h 


Interspersion (1) 

Cover Type (2) 

Elevation (4) 
Low High 

Elevation (4) 
Low High 


Conifer (i) 

9 

0 

15 

4 

Low (i) 

Oak (ii) 

23 

24 

38 

40 


Aspen (iii) 

5 

6 

40 

n 


Other (iv) 

3 

3 

7 

50 


Conifer (i) 

6 

3 

13 

7 

High (ii) 

Oak (ii) 

10 

21 

23 

33 


Aspen (iii) 

10 

12 

23 

13 


Other (iv) 

3 

13 

9 
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Statistical Procedures 

The logit and log-linear model fitting procedures were both used in 
this study. In each case the objective is to determine the simplest best 
fitting model which explains- the data. Each procedure is outlined here. 

Log-linear Model Fitting 

The first step in this procedure is to test all uniform models. 

These may be defined as models which contain all n-way interactions where 
n ranges from one to the number of variables (Fienberg 1980). The uniform 
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order model is denoted by t*. If t* * 3, for example, this indicates 
that all 3“way interaction terms are present in the model. All lower 
order terms are also present by default. Thus in the example above, 
all 2-way and 1-way terms would also be included in the model. 

After all uniform order models are tested, one of the stepwise 

model fitting procedures may be used. These are the forward and the 

backward procedures. In the case of the former, the researcher chooses 

the uniform order model which provides a poor fit (p < .05) where the 

next higher uniform order model provides a good fit (p > .05). In this 

study, for example, t* = 1 (no interaction terms) provided an extremely 

poor fit (p < .005), T* = 2 provided a poor fit (p < .025), and x* = 3 

provided a good fit{.5 > p > .25). Thus x* = 2 was chosen to begin the 

forward stepwise model building procedure. The next step involves 

adding the next higher order interaction terms to the model one by one 

and the resulting model which yields the highest p-value is chosen. This 

is repeated until all significant terms are included in the model. In 

each case, the criteria for testing models is based on the Likelihood 
2 2 

Ratio (G ) which is an asymptotic t distribution. The critical value 

2 

for testing each model may therefore be obtained from a x table using 
the proper degrees of freedom. 

The backward selection procedure begins with the uniform order model 
which provides a good fit (p > .05) where the next lower uniform order 
model provides a poor fit (p < .05). In this study the uniform order model 
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X* = 3 was chosen to begin the backward procedure. Successive 3-way 

interaction terms are then dropped from the model. Each resulting 

2 

model IS again tested with the G statistic. 

Logit Fitting Procedure 

When using the logit model fitting procedure, the researcher 
first assumes that there is one response variable, and all other variables 
are explanatory. The term denoting the interaction of all explanatory variables 
is therefore present in every model tested. In this study it was 
assumed that interspersion is a response of cover type, aspect, and 
elevation. Thus the term denoting the interaction of variables 2, 3, and 
4 was present in every model tested. Choosing the proper uniform order 
model when using the logit procedure is similar to that described for 
the log-linear procedure as is the forward and backward model selection. 
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All model selection procedures described here were conducted 
using an interactive APL computer program written by Dr. S. K. Lee of 
the Department of Statistics, VPI & SU. See Fienberg (1980) and 
Bishop et (1975) for a more thorough description of these statistical 
procedures. 


Results and Discussion 

Using forward and backward stepwise model selections for both the 
log-linear and logit procedures, the best fitting model is as follows: 

[1 23] [23 4] [1 4] 

This indicates that variables 1 (interspersion) , 2 (cover type), 
and 3 (aspect) all interact jointly. Variables 2 (cover type), 3 (aspect), 
and 4 (elevation) also interact jointly. The additional 2-way interaction 
term of variables 1 (interspersion), and 4 (elevation) is also a signifi- 
cant feature of the model. Thus, all possible 2-way interaction terms 
are present by default in this model. 

In order to analyze the particular associations represented by 
each individual interaction term, tables were prepared for each term by 
summing across the variable(s) not present in that term (Tables 2, 3, 
and 4). In preparing the table for the [1 2 3] interaction term (Table 2), 
for example, the raw data presented in Table I was summed across variable 
4, which is not present in that term. The numbers obtained by this 
summation were then converted into proportions by dividing each into the 
sum of all numbers on the table. In this way, the type of association 
between variables may be readily determined by comparing the proportions. 
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Table 2. The observed frequencies describing the [1 2 3] interaction 
term. 


Low Interspersion High Interspersion 
North South North South 
Slope Slope Slope Slope 


Conifer 

.03 

.07 

.10 

Conifer 

.04 

,09 

Oak 

.17 

.28 

.45 

Oak 

.14 

.25 

Aspen 

.04 

.18 

.22 

Aspen 

.10 

.16 

Other 

.02 

.21 

.23 

Other 

.07 

.16 


.26 

.74 



.35 

.65 


.13 


.38 


.26 

.23 


Table 2 indicates a general trend toward lower interspersion on 
south than north facing slopes. Lower interspersion values also tend to 
be associated with oak stands, whereas higher interspersion values tend 
to be associated with conifer and aspen stands. However, aspen, as well 
as cells designated as other tend to be associated with low interspersion 
values on south-facing slopes. 


Table 3. Observed frequencie*- describing the [2 3 4] interaction term. 


^ow Elevation High Elevation 



North 

Slopes 

South 

Slopes 



North 

Slopes 

South 

Slopes 

Conifer 

.06 

.12 

.18 

Conifer 

.01 

.04 

Oak 

.14 

.26 

.40 

Oak 

.17 

.27 

Aspen 

.06 

.27 

.33 

Aspen 

.07 

.09 

Other 

.03 

.06 

.09 

Other J 

.06 

.29 


.05 


.44 


.16 


.35 


.29 


.71 


.31 


.69 
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Table 3 indicates that conifer and aspen are more prevalent 
at low elevations, while oak and the "other" category are more 
prevalent at higher elevations. The conifer and aspen are more common 
on south vs. north- facing slopes in these lcn«r elevations. Cover type 
4 (other) is more common at higher elevations on south vs. north-facing 
slopes. It must be pointed out that we may expect different trends for 
different conifer species. Spruce and fir, for exan^le, may be expected 
to grow more readily at higher elevations on north-facing slopes. The 
data base used here, howver, collapsed categories according to the 
habitat requirements of mule deer, and thus all conifers were included 
in one category. Trends for different conifer species could be readily 
determined using the same procedures had these categories not been 
collapsed. 

Table 4. Observed frequencies describing the [1 4] interaction term 



Low Interspersion 

High Interspersion 

Low Elevation 

.28 

.19 

High Elevation 

.27 1 

.26 I 


.55 

.45 


Table 4 indicates that higher interspersion is generally associated 
with higher elevations, and low Interspersion is more coirmon at low 
elevations. 





29 


The Information obtained from the model building procedures 
presented here has utility as a predictive tool. Managerial decisions 
may be based in part on such information. For example, clearcuts or 
prescription burns could be placed more appropriately within specific 
elevational and slope regimes to achieve a generally higher inter- 
spersion index throughout and area' if this is desirable. 

Of particular interest in this scenario is the interaction of 
interspersion with the other variables. Knowledge of specific inter- 
actions can be used to stratify areas according to its habitat potential 
for i\ selected species resulting from the effects of cover type, elevation, 
and aspect. One important point is that the model which resulted in this 
case can only be applied to this area and to this species (mule deer). 

This is because of the cover type collapsing procedure. If such informa- 
tion is desired for other areas or species, a different collapsing pro- 
cedure may be necessary. 

One major advantage of this technique is that, after initial data 
generation, the model building procedure itself is a rather rapid process 
whan using an interactive program. The entire process of forward and 
backward log- linear and logit model selection procedures took only 1-1/2 
hours to complete on the computer terminal. This time would vary depending 
on the experience of the researchers with interactive programs, as well 
as on the number of factors to be analyzed in the model, but it is 
generally a rapid process. 
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Other spatial variables of Interest could easily be tested. 

Some of these may Include water availability, slope (steepness) or 
human disturbances. These could easily be categorized appropriately 
and used as additional dimensions In the mult1*way classification 
in determining the effects of each on habitat quality. Such procedures 
are thus readily expandable and can easily test the simultaneous 
effects of any number of variables which may affect the aspects of 
habitat quality analyzed here. 
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Appendix II 


Spatial Autocorrelation Statistics for the 
Non-free Sampling Binary Classification Case* 


*Moran, P.A.P. 1948. The interpretation of statistical maps. 
Journal Royal Statistical Society, Series B. Vol . 10. 
pp. 243-251. 
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Autocorrelation 
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Notation 

u^"(X) ■ The first moment of X about the origin, the expected 
value of X. 

V 2 W ” The second moment of X about the mean, the variance 
of X. 

n^^^ » n (n-1) . .. (n -• 1 + 1) 

n ■ The total number of Individuals In the population. 

n^ = The nimter of Individuals In the population with the 
characteristic of Interest. 

r \2 ■ The nwrfcer of Individuals In the population without the 
characteristic of Interest. 

» The number of individuals joined with the 1^^ individual. 

A » 1/2 Z L. 

1 ^ 

D - 1/2 Z L. CL. - 1) 

1 ’ ^ 

» 1, if i^*' and j^^ areas are jointd 
0, otherwise 

X^ » 1, if area is correctly classified 
0, otherwise 
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